101 research outputs found

    Which gene did you mean?

    Get PDF
    Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing

    Mining microarray datasets aided by knowledge stored in literature

    Get PDF
    DNA microarray technology produces large amounts of data. For data mining of these datasets, background information on genes can be helpful. Unfortunately most information is stored in free text. Here, we present an approach to use this information for DNA microarray data mining

    Towards the tipping point of FAIR implementation

    Get PDF
    This article explores the global implementation of the FAIR Guiding Principles for scientific management and data stewardship, which provide that data should be findable, accessible, interoperable and reusable. The implementation of these principles is designed to lead to the stewardship of data as FAIR digital objects and the establishment of the Internet of FAIR Data and Services (IFDS). If implementation reaches a tipping point, IFDS has the potential to revolutionize how data is managed by making machine and human readable data discoverable for reuse. Accordingly, this article examines the expansion of the implementation of FAIR Guiding Principles, especially how and in which geographies (locations) and areas (topic domains) implementation is taking place. A literature review of academic articles published between 2016 and 2019 on the use of FAIR Guiding Principles is presented. The investigation also includes an analysis of the domains in the IFDS Implementation Networks (INs). Its uptake has been mainly in the Western hemisphere. The investigation found that implementation of FAIR Guiding Principles has taken firm hold in the domain of bio and natural sciences. To achieve a tipping point for FAIR implementation, is now time to ensure the inclusion of non-European ascendants and of other scientific domains. Apart from equal opportunity and genuine global partnership issues, a permanent European bias poses challenges with regard to the representativeness and validity of data and could limit the potential of IFDS to reach across continental boundaries. The article concludes that, despite efforts to be inclusive, acceptance of the FAIR Guiding Principles and IFDS in different scientific communities is limited and there is a need to act now to prevent dampening of the momentum in the development and implementation of the IFDS. It is further concluded that policy entrepreneurs and the GO FAIR INs may contribute to making the FAIR Guiding Principles more flexible in including different research epistemologies, especially through its GO CHANGE pillar. LIACS-Managemen

    Thesaurus-based disambiguation of gene symbols

    Get PDF
    BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. RESULTS: We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. CONCLUSION: The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications

    Literature-aided meta-analysis of microarray data: a compendium study on muscle development and disease

    Get PDF
    Background: Comparative analysis of expression microarray studies is difficult due to the large influence of technical factors on experimental outcome. Still, the identified differentially expressed genes may hint at the same biological processes. However, manually curated assignment of genes to biological processes, such as pursued by the Gene Ontology (GO) consortium, is incomplete and limited. We hypothesised that automatic association of genes with biological processes through thesaurus-controlled mining of Medline abstracts would be more effective. Therefore, we developed a novel algorithm (LAMA: Literature-Aided Meta-Analysis) to quantify the similarity between transcriptomics studies. We evaluated our algorithm on a large compendium of 102 microarray studies published in the field of muscle development and disease, and compared it to similarity measures based on gene overlap and over-representation of biological processes assigned by GO. Results: While the overlap in both genes and overrepresented GO-terms was poor, LAMA retrieved many more biologically meaningful links between studies, with substantially lower influence of technical factors. LAMA correctly grouped muscular dystrophy, regeneration and myositis studies, and linked patient and corresponding mouse model studies. LAMA also retrieves the connecting biological concepts. Among other new discoveries, we associated cullin proteins, a class of ubiquitinylation proteins, with genes down-regulated during muscle regeneration, whereas ubiquitinylation was previously reported to be activated during the inverse process: muscle atrophy. Conclusion: Our literature-based association analysis is capable of finding hidden common biological denominators in microarray studies, and circumvents the need for raw data analysis or curated gene annotation databases

    Ambiguity of human gene symbols in LocusLink and MEDLINE: creating an inventory and a disambiguation test collection

    Get PDF
    Genes are discovered almost on a daily basis and new names have to be found. Although there are guidelines for gene nomenclature, the naming process is highly creative. Human genes are often named with a gene symbol and a longer, more descriptive term; the short form is very often an abbreviation of the long form. Abbreviations in biomedical language are highly ambiguous, i.e., one gene symbol often refers to more than one gene.Using an existing abbreviation expansion algorithm,we explore MEDLINE for the use of human gene symbols derived from LocusLink. It turns out that just over 40% of these symbols occur in MEDLINE, however, many of these occurrences are not related to genes. Along the process of making an inventory, a disambiguation test collection is constructed automatically

    Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes

    Get PDF
    MOTIVATION: The advent of high-throughput experiments in molecular biology creates a need for methods to efficiently extract and use information for large numbers of genes. Recently, the associative concept space (ACS) has been developed for the representation of information extracted from biomedical literature. The ACS is a Euclidean space in which thesaurus concepts are positioned and the distances between concepts indicates their relatedness. The ACS uses co-occurrence of concepts as a source of information. In this paper we evaluate how well the system can retrieve functionally related genes and we compare its performance with a simple gene co-occurrence method. RESULTS: To assess the performance of the ACS we composed a test set of five groups of functionally related genes. With the ACS good scores were obtained for four of the five groups. When compared to the gene co-occurrence method, the ACS is capable of revealing more functional biological relations and can achieve results with less literature available per gene. Hierarchical clustering was performed on the ACS output, as a potential aid to users, and was found to provide useful clusters. Our results suggest that the algorithm can be of value for researchers studying large numbers of genes. AVAILABILITY: The ACS program is available upon request from the authors

    Using contextual queries

    Get PDF
    Search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. An approach is demonstrated that explores implicit contexts as obtained from a document the user is reading. The approach inserts into an original (web) document functionality to directly activate context driven queries that yield related articles obtained from various information sources

    Calling on a million minds for community annotation in WikiProteins.

    Get PDF
    WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. The system is available for beta testing at http://www.wikiprofessional.org.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Open Science for a Global Transformation

    Get PDF
    This is the final version. Available on open access via the DOI in this recordEuropean CommissionAlan Turing Institut
    • …
    corecore